Module 1: Reading in and processing Word documents (Focus Group data)

Sourcing packages

  • The textract package is used to read in the .docx files.
  • The gensim package is used to fit prelimnary LDA models on the data and filter out words which are common to the majority of the identified topics.
  • The nltk package is used to get an initial list of stopwords and for word lemmatization.
In [14]:
import textract
import numpy as np
import scipy
import gensim
import os
import pandas as pd
import re
import math
import nltk
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
#nltk.download('averaged_perceptron_tagger')
from collections import Counter
from matplotlib import pyplot as plt
from gensim import corpora, models
from itertools import repeat
%matplotlib inline

Definition of FocusGroup class

Instantiation

  • By giving the name of the word file. The word file should have the same format as the focus group documents, that is, each paragraph should be preceeded by a line specifying the name (e.g. Parent 1) of the currently speaking person. ### Attributes
  • raw_text: The raw text from the Word document.
  • parent_moderator_discussion: The part of raw_text which refers to the discussion between parents and moderators. The rationale for separating the parent_moderator_discussion and within_moderator_discussion attributes is that there was a case when there was a discussion only between the moderators after the discusion between parents and moderators.
  • text_including_parents: An np.array storing the discussion between the parents and moderators. Each element of the np.array contains a paragraph from the discussion.
  • talkers_including_parents: An np. array with an identical size as text_including_parents containing the respective talker's name (e.g. Parent 1).
  • within_moderator_discussion: The part of raw_text which refers to the discussion only moderators, if available. This part of the text was separated from the parent / moderator discussion part of the text by two blank lines.
  • text_only_moderators: An np.array storing the discussion only between the moderators, if available. Each element of the np.array contains a paragraph from the discussion.
  • talkers_only_moderators: An np. array with an identical size as text_only_moderators containing the respective talker's name (e.g. Moderator 1).
  • parent_list: List of unique parent participants.
  • moderator_list: List of unique moderator participants. ### Methods
  • get_participant_text(participant): gets the list of paragraphs which belong to the given participant.
In [6]:
class FocusGroup:
    def __init__(self, filename):
        self.raw_text=str(textract.process('Data/FocusGroups/' + filename + ".docx")).replace('b\'', '').replace('\'', '')
        
        self.parent_moderator_discussion=self.raw_text.split('\\n\\n\\n')[0].split('\\n\\n')
        self.text_including_parents=np.array([parent_moderator_actual
                                    for parent_moderator_actual in self.parent_moderator_discussion 
                                    if not (('Parent'==re.sub(r" [0-9]:","",parent_moderator_actual)) or 
                                        ('Moderator'==re.sub(r" [0-9]:","",parent_moderator_actual)) or 
                                        ('Administrator'==re.sub(r" [0-9]:","",parent_moderator_actual)) or
                                        ('Speaker'==re.sub(r" [0-9]:","",parent_moderator_actual)))])
        self.talkers_including_parents=np.array([parent_moderator_actual.replace(':', '') 
                                    for parent_moderator_actual in self.parent_moderator_discussion 
                                    if (('Parent'==re.sub(r" [0-9]:","",parent_moderator_actual)) or 
                                        ('Moderator'==re.sub(r" [0-9]:","",parent_moderator_actual)) or 
                                        ('Administrator'==re.sub(r" [0-9]:","",parent_moderator_actual)) or
                                        ('Speaker'==re.sub(r" [0-9]:","",parent_moderator_actual)))])
        
        if len(self.raw_text.split('\\n\\n\\n'))>1:
            self.within_moderator_discussion=self.raw_text.split('\\n\\n\\n')[1].split('\\n\\n')
            self.text_only_moderators=np.array([parent_moderator_actual
                                    for parent_moderator_actual in self.within_moderator_discussion 
                                    if not (('Parent'==re.sub(r" [0-9]:","",parent_moderator_actual)) or 
                                        ('Moderator'==re.sub(r" [0-9]:","",parent_moderator_actual)) or 
                                        ('Administrator'==re.sub(r" [0-9]:","",parent_moderator_actual)) or
                                        ('Speaker'==re.sub(r" [0-9]:","",parent_moderator_actual)))])
            self.talkers_only_moderators=np.array([parent_moderator_actual.replace(':', '') 
                                    for parent_moderator_actual in self.within_moderator_discussion 
                                    if (('Parent'==re.sub(r" [0-9]:","",parent_moderator_actual)) or 
                                        ('Moderator'==re.sub(r" [0-9]:","",parent_moderator_actual)) or 
                                        ('Administrator'==re.sub(r" [0-9]:","",parent_moderator_actual)) or
                                        ('Speaker'==re.sub(r" [0-9]:","",parent_moderator_actual)))])
        
        self.parent_list=[participant for participant in set(self.talkers_including_parents) if 'Parent' in participant]
        self.moderator_list=[participant for participant in set(self.talkers_including_parents) if 'Moderator' in participant]
        
        
    def get_participant_text(self, participant):
        if 'Parent' in participant:
            mask=[member==participant for member in self.talkers_including_parents]
            return list(self.text_including_parents[mask])
        elif 'Moderator' in participant:
            mask=[member==participant for member in self.talkers_including_parents]
            text_from_parent_discussion=self.text_including_parents[mask]
            
            if len(self.raw_text.split('\\n\\n\\n'))==1:
                return list(text_from_parent_discussion)
            else:
                mask=[member==participant for member in self.talkers_only_moderators]
                text_from_moderator_discussion=self.text_only_moderators[mask]
                return list(text_from_parent_discussion) + list(text_from_moderator_discussion)

Functions to process text

  • The original list of stopwords was augmented by stopwords which are filler words (for example, okay) or are the consequences of the automated transcription (for example, inaudible), this extra list of stopwords was saved under the custom_stopwords list.
  • The WordNetLemmatizer() class of the nltk library was used for lemmatization.
  • The following data processing steps are performed by the text_processing_pipeline function:
    • Making the string lowercase
    • Removal of punctuation
    • Tokenization
    • Removal of text with less than min_token_count tokens
    • Removing stopwords
    • Lemmatization
    • Removing stopwords (also after the lemmatization)
  • The output of the text processing pipeline is a list with the elements, the first element is the processed, tokenized text and the second element is the original text with the purpose to help with the intepretation of the results.
In [7]:
stopwords_list=stopwords.words('english')
custom_stopwords=['go','parent','say','0','yeah','would','okay','start','also','well','u','thank','inaudible','crosstalk','able','hear','actually','hi','oh','definitely','part','anything','sure','anyone','yes','thanks','everything','end','everybody','tand','administrator','whatever','sound','ti','moderator','though','mute','speak','silence','finish','bye','audio']
stopwords_list=stopwords_list+custom_stopwords
remove_stopwords_function=lambda tokenized_text, stopwords: [word for word in tokenized_text if word not in stopwords]
lemmatizer_instance=WordNetLemmatizer()
pos_tags_lemmatize_mapping_dict={'N': 'n', 'V': 'v', 'J': 'a', 'R': 'r'}

def pos_mapping_function(pos_tag, dictionary=pos_tags_lemmatize_mapping_dict):
    if pos_tag[0] in ['N', 'V', 'J', 'R']:
        return dictionary[pos_tag[0]]
    else:
        return 'n'
    
def lemmatizer_function(text, dictionary=pos_tags_lemmatize_mapping_dict, pos_mapping_function=pos_mapping_function,
                       lemmatizer=lemmatizer_instance):
    pos_tags_for_lemmatize=[(word, pos_mapping_function(pos_tag)) for word, pos_tag in nltk.pos_tag(text)]
    pos_tags_lemmatized=[lemmatizer_instance.lemmatize(word, pos=pos_tag) for word, pos_tag in pos_tags_for_lemmatize]
    return pos_tags_lemmatized

def text_processing_pipeline(text_list,additional_stopwords, min_token_count=1, stopwords_list=stopwords_list, 
                             lemmatizer_function=lemmatizer_function, dictionary=pos_tags_lemmatize_mapping_dict,
                             pos_mapping_function=pos_mapping_function, lemmatizer=lemmatizer_instance):
    stopwords_list=stopwords_list+additional_stopwords
    lowercase_text_list=[text.lower() for text in text_list] #Making text lowercase
    lowercase_text_list=[re.sub(r"[^a-zA-Z0-9]", " ", text) for text in lowercase_text_list] #Removal of punctuation
    lowercase_text_list=[text.split() for text in lowercase_text_list] #Tokenization
    filtering_original_text=[text_list[i] for i in range (len(lowercase_text_list)) if len(lowercase_text_list[i])>min_token_count]
    lowercase_text_list=[text for text in lowercase_text_list if len(text)>min_token_count] #Keeping text with an at least a pre-defined token count
    lowercase_text_list=[remove_stopwords_function(text, stopwords_list) for text in lowercase_text_list] #Removing stopwords
    lowercase_text_list=[lemmatizer_function(text) for text in lowercase_text_list] #Lemmatization
    lowercase_text_list=[remove_stopwords_function(text, stopwords_list) for text in lowercase_text_list] #Removing stopwords
    return lowercase_text_list, filtering_original_text

Process the word data

  • Loop over the forteen Word documents with the text processing function and save the result in a list with 15 elements.
  • The below code cell contains four lists of additional stopwords for Gaming group / Low PIU group / Media group and Social group, respectively, this set of additional stopwords was generated by Module 2 by iteratively running the gensim LDA algorithm and excluding the words which were included in at least 3 of the 5 topics. The purpose of this data processing step is to avoid having the same set of words in all topics.
  • The min_token_count of the text_processing_pipeline function was set to 60, so only paragraphs with at least 60 tokens were kept in the dataset.
In [8]:
file_list=['Gaming_Group1', 'Gaming_Group2', 'Gaming_Group3', 'Gaming_Group4',
           'LowPIU_Group1', 'LowPIU_Group2', 'LowPIU_Group3',
           'Media_Group1', 'Media_Group2', 'Media_Group3', 'Media_Group4',
           'Social_Group1', 'Social_Group2', 'Social_Group3', 'Social_Group4']
additional_stopword_counts=list(dict(Counter([re.sub('[0-9]', '', file,) for file in file_list])).values())
Gaming_group_stopwords=['like', 'get', 'school', 'hour', 'day', 'even', 'think', 'thing', 'way', 'know', 'year', 'week', 'really', 'one',
                       'kid', 'game', 'use', 'time', 'want', 'play', 'much', 'back']
Low_PIU_group_stopwords=['school', 'like', 'time', 'get', 'think', 'kid', 'really',
                        'thing', '00', 'technology', 'year', 'child', 'back', 'lot',
                        'even', 'know', 'want', 'old', 'one']
Media_group_stopwords=['like', 'thing', 'get', 'really', 'kid', 'time', 'want',
                      'school', 'think', 'know', 'one', 'use',
                      'year', 'much', 'back', 'work', 'person', 'pandemic',
                      'see', 'lot', 'good', 'little', 'day', 'old']
Social_group_stopwords=['like', 'get', 'think', 'know', 'thing', 'time', 'school',
                       'really', 'child', 'see', 'want',
                       'kid', 'one', 'lot', 'even']
additional_stopwords_list=[Gaming_group_stopwords, Low_PIU_group_stopwords, Media_group_stopwords, Social_group_stopwords]
additional_stopwords_list=[[stopword_list]*count for count, stopword_list in zip(additional_stopword_counts, additional_stopwords_list)]
additional_stopwords_list=[stopword for additional_stopword in additional_stopwords_list for stopword in additional_stopword]
all_focusgroup_text=[FocusGroup(focus_group_file) for focus_group_file in file_list]
all_focusgroup_processed_text=[text_processing_pipeline(focus_group.text_including_parents,additional_stopword_list, min_token_count=60) for focus_group, additional_stopword_list in zip(all_focusgroup_text, additional_stopwords_list)]

Module 2: Running the gensim LDA method iteratively to identify additional stopwords and the application of the Bertopic Dynamic topic modelling after filtering out the stopwords

Import classes to fit LDA with gensim and Bertopic

In [22]:
from gensim.test.utils import common_texts
from gensim.corpora.dictionary import Dictionary
from gensim.models import LdaModel
from bertopic import BERTopic

Functions to run the LDA method on the prepared data iteratively

  • The collapse_list_of_strings function is needed to ensure the compatibility of the required input from gensim (raw text instead of tokenized text) and the text processing pipeline from Module 1.
  • The identify_additional_stopwords_with_LDA function identifies the intersection of words from the different topics. If a particular word was included in at least 3 out of the 5 topics in our example, those stopwords were excluded. We run this iteratively until there were only a few, 3 or 4 words which were common across the topics.
In [10]:
def collapse_list_of_strings(list_of_strings):
    return ' '.join(list_of_strings)

def identify_additional_stopwords_with_LDA(tokenized_text_list, number_of_topics, stopword_threshold):
    common_dictionary = Dictionary(tokenized_text_list)
    common_corpus = [common_dictionary.doc2bow(text) for text in tokenized_text_list]
    lda = LdaModel(common_corpus, num_topics=number_of_topics, id2word=common_dictionary, alpha='auto', eta='auto')
    most_common_words=Counter([term[0] for topic_num in range(number_of_topics) 
                                       for term in lda.show_topic(topicid=topic_num)]).most_common()
    additional_stopwords=[term[0] for term in most_common_words if term[1]>=stopword_threshold]
    return additional_stopwords

Run LDA model for Gaming Group

In [21]:
#Get the Gaming group data from the list of processed word files:
data=[text for gaming_group_text in all_focusgroup_processed_text[0:4] for text in gaming_group_text[0]]
#Prepare Bag of Words representation for LDA:
common_dictionary = Dictionary(data)
common_corpus = [common_dictionary.doc2bow(text) for text in data]
#Print out Words which were included in at least 3 out of the 5 topics:
print(identify_additional_stopwords_with_LDA(data, 5, 3))
#Fit the LDA model:
lda = LdaModel(common_corpus, num_topics=5, id2word=common_dictionary, alpha='auto', eta='auto')
#Check the distribution of topics:
topic_distribution=[lda.get_document_topics(bow=bow_text)[0][0] for bow_text in common_corpus]
print(Counter(topic_distribution))
#Show the most frequent words with their respective probability:
lda.show_topics()
['work', 'mean', 'lot', 'friend', 'teacher', 'little']
Counter({1: 67, 4: 54, 0: 52, 3: 49, 2: 35})
Out[21]:
[(0,
  '0.013*"watch" + 0.012*"daughter" + 0.011*"mean" + 0.010*"lot" + 0.010*"teacher" + 0.009*"see" + 0.008*"child" + 0.008*"show" + 0.007*"two" + 0.007*"work"'),
 (1,
  '0.014*"computer" + 0.012*"good" + 0.009*"issue" + 0.008*"old" + 0.007*"teacher" + 0.007*"still" + 0.007*"two" + 0.006*"maybe" + 0.006*"feel" + 0.006*"little"'),
 (2,
  '0.014*"friend" + 0.010*"mean" + 0.009*"00" + 0.009*"pandemic" + 0.008*"technology" + 0.007*"something" + 0.006*"child" + 0.006*"positive" + 0.006*"social" + 0.006*"old"'),
 (3,
  '0.011*"still" + 0.009*"feel" + 0.009*"00" + 0.008*"youtube" + 0.007*"child" + 0.007*"control" + 0.007*"technology" + 0.007*"pandemic" + 0.006*"computer" + 0.006*"home"'),
 (4,
  '0.011*"little" + 0.010*"lot" + 0.010*"teacher" + 0.009*"mean" + 0.009*"work" + 0.008*"something" + 0.008*"friend" + 0.008*"could" + 0.007*"00" + 0.007*"see"')]

Prepare data for dynamic topic modelling

  • We didn't have the exact dates for the Word files, so the values in the date_list variables are just for the sake of illustration. If focus group data from other dates become available in the future, this part can be leveraged with the docs from different dates.
In [23]:
Gaming_group_data=[collapse_list_of_strings(text) for gaming_group_text in all_focusgroup_processed_text[0:4]
                                                  for text in gaming_group_text[0]]
number_of_tokens_per_text=[len(gaming_group_text[0]) for gaming_group_text in all_focusgroup_processed_text[0:4]]
date_list=['4/1/2020', '5/1/2020', '11/1/2020', '4/1/2021']
date_list_long=[list(repeat(date_list[i], number_of_tokens_per_text[i])) for i in range(4)]
date_list_long=[date for date_list in date_list_long for date in date_list]

Fit the Bertopic model

In [24]:
topic_model=BERTopic(language="english", calculate_probabilities=True, verbose=True, n_gram_range=(1,1))
topics, probs = topic_model.fit_transform(Gaming_group_data)
topics_over_time = topic_model.topics_over_time(Gaming_group_data, topics, date_list_long)
2021-12-03 16:32:20,867 - BERTopic - Transformed documents to Embeddings
2021-12-03 16:32:35,521 - BERTopic - Reduced dimensionality with UMAP
2021-12-03 16:32:35,597 - BERTopic - Clustered UMAP embeddings with HDBSCAN
4it [00:02,  1.64it/s]
In [25]:
topic_model.visualize_topics_over_time(topics_over_time)
May 2020Jul 2020Sep 2020Nov 2020Jan 2021Mar 2021051015202530
Global Topic Representation-1_home_ipad_need_son0_friend_mean_video_social1_pandemic_technology_group_people2_computer_adhd_laptop_issue3_teacher_classroom_class_assignment4_watch_tv_daughter_125_youtube_block_phone_robloxTopics over TimeFrequency
In [26]:
freq = topic_model.get_topic_info(); print(freq.head(10))
topic_model.visualize_barchart(topics=list(range(-1,6)), n_words=10)
   Topic  Count                                  Name
0     -1     73                 -1_home_ipad_need_son
1      0     49            0_friend_mean_video_social
2      1     37    1_pandemic_technology_group_people
3      2     35          2_computer_adhd_laptop_issue
4      3     28  3_teacher_classroom_class_assignment
5      4     19                4_watch_tv_daughter_12
6      5     16          5_youtube_block_phone_roblox
youtube homework work son ipad 32 13 stuff social mean covid problem positive people technology pandemic technology device issue adhd camera tutor daughter assignment classroom netflix watched girl 12 tv 00.050.1minecraft watch call roblox block
Topic Word ScoresTopic -1Topic 0Topic 1Topic 2Topic 3Topic 4Topic 5

Run LDA model for Low PIU Group

In [27]:
#Get the Gaming group data from the list of processed word files:
data=[text for gaming_group_text in all_focusgroup_processed_text[4:7] for text in gaming_group_text[0]]
#Prepare Bag of Words representation for LDA:
common_dictionary = Dictionary(data)
common_corpus = [common_dictionary.doc2bow(text) for text in data]
#Print out Words which were included in at least 3 out of the 5 topics:
print(identify_additional_stopwords_with_LDA(data, 5, 3))
#Fit the LDA model:
lda = LdaModel(common_corpus, num_topics=5, id2word=common_dictionary, alpha='auto', eta='auto')
#Check the distribution of topics:
topic_distribution=[lda.get_document_topics(bow=bow_text)[0][0] for bow_text in common_corpus]
print(Counter(topic_distribution))
#Show the most frequent words with their respective probability:
lda.show_topics()
['use', 'see']
Counter({1: 42, 4: 28, 0: 25, 3: 18, 2: 14})
Out[27]:
[(0,
  '0.009*"way" + 0.008*"negative" + 0.008*"try" + 0.008*"make" + 0.008*"positive" + 0.007*"little" + 0.007*"impact" + 0.007*"bit" + 0.006*"son" + 0.006*"game"'),
 (1,
  '0.010*"people" + 0.009*"try" + 0.009*"pandemic" + 0.008*"daughter" + 0.008*"game" + 0.008*"take" + 0.008*"son" + 0.008*"play" + 0.008*"mean" + 0.007*"use"'),
 (2,
  '0.010*"take" + 0.009*"screen" + 0.009*"learn" + 0.008*"love" + 0.007*"friend" + 0.007*"still" + 0.007*"online" + 0.006*"different" + 0.006*"home" + 0.006*"talk"'),
 (3,
  '0.012*"use" + 0.011*"day" + 0.009*"focus" + 0.008*"covid" + 0.008*"computer" + 0.007*"concern" + 0.007*"home" + 0.007*"pandemic" + 0.006*"kind" + 0.006*"2"'),
 (4,
  '0.015*"1" + 0.010*"teacher" + 0.010*"remote" + 0.009*"use" + 0.008*"much" + 0.007*"maybe" + 0.007*"class" + 0.007*"work" + 0.007*"see" + 0.007*"3"')]

Prepare data for dynamic topic modelling

  • We didn't have the exact dates for the Word files, so the values in the date_list variables are just for the sake of illustration. If focus group data from other dates become available in the future, this part can be leveraged with the docs from different dates.
In [29]:
Gaming_group_data=[collapse_list_of_strings(text) for gaming_group_text in all_focusgroup_processed_text[4:7]
                                                  for text in gaming_group_text[0]]
number_of_tokens_per_text=[len(gaming_group_text[0]) for gaming_group_text in all_focusgroup_processed_text[4:7]]
date_list=['4/1/2020', '5/1/2020', '11/1/2020', '4/1/2021']
date_list_long=[list(repeat(date_list[i], number_of_tokens_per_text[i])) for i in range(3)]
date_list_long=[date for date_list in date_list_long for date in date_list]
In [30]:
topic_model=BERTopic(language="english", calculate_probabilities=True, verbose=True, n_gram_range=(1,1))
topics, probs = topic_model.fit_transform(Gaming_group_data)
topics_over_time = topic_model.topics_over_time(Gaming_group_data, topics, date_list_long)
2021-12-03 16:38:20,756 - BERTopic - Transformed documents to Embeddings
2021-12-03 16:38:25,276 - BERTopic - Reduced dimensionality with UMAP
2021-12-03 16:38:25,319 - BERTopic - Clustered UMAP embeddings with HDBSCAN
3it [00:01,  2.58it/s]
In [31]:
topic_model.visualize_topics_over_time(topics_over_time)
Apr 2020May 2020Jun 2020Jul 2020Aug 2020Sep 2020Oct 2020Nov 2020051015202530
Global Topic Representation-1_home_mean_focus_teacher0_remote_class_teacher_classroom1_pandemic_group_information_insight2_game_computer_day_homeTopics over TimeFrequency
In [32]:
freq = topic_model.get_topic_info(); print(freq.head(10))
topic_model.visualize_barchart(topics=list(range(-1,3)), n_words=10)
   Topic  Count                                  Name
0     -1     70            -1_home_mean_focus_teacher
1      0     21      0_remote_class_teacher_classroom
2      1     20  1_pandemic_group_information_insight
3      2     16              2_game_computer_day_home
feel positive work social day talk teacher focus mean home texting socialization adhd program student daughter classroom teacher class remote receive email education impact concern positive insight information group pandemic 00.050.1covid night chromebook daughter screen play home day computer game
Topic Word ScoresTopic -1Topic 0Topic 1Topic 2

Run LDA model for Media Group

In [33]:
#Get the Gaming group data from the list of processed word files:
data=[text for gaming_group_text in all_focusgroup_processed_text[7:11] for text in gaming_group_text[0]]
#Prepare Bag of Words representation for LDA:
common_dictionary = Dictionary(data)
common_corpus = [common_dictionary.doc2bow(text) for text in data]
#Print out Words which were included in at least 3 out of the 5 topics:
print(identify_additional_stopwords_with_LDA(data, 5, 3))
#Fit the LDA model:
lda = LdaModel(common_corpus, num_topics=5, id2word=common_dictionary, alpha='auto', eta='auto')
#Check the distribution of topics:
topic_distribution=[lda.get_document_topics(bow=bow_text)[0][0] for bow_text in common_corpus]
print(Counter(topic_distribution))
#Show the most frequent words with their respective probability:
lda.show_topics()
[]
Counter({3: 35, 0: 35, 2: 34, 1: 32, 4: 31})
Out[33]:
[(0,
  '0.012*"give" + 0.009*"guy" + 0.008*"even" + 0.008*"computer" + 0.007*"today" + 0.007*"information" + 0.007*"feel" + 0.006*"book" + 0.006*"help" + 0.006*"friend"'),
 (1,
  '0.012*"watch" + 0.011*"two" + 0.008*"learn" + 0.008*"video" + 0.008*"way" + 0.008*"stuff" + 0.008*"week" + 0.007*"even" + 0.007*"class" + 0.007*"make"'),
 (2,
  '0.012*"computer" + 0.011*"technology" + 0.010*"concern" + 0.008*"bit" + 0.008*"watch" + 0.007*"experience" + 0.007*"teacher" + 0.007*"home" + 0.007*"share" + 0.006*"two"'),
 (3,
  '0.012*"remote" + 0.011*"home" + 0.010*"mean" + 0.009*"technology" + 0.009*"teacher" + 0.007*"class" + 0.007*"kind" + 0.007*"son" + 0.007*"come" + 0.006*"take"'),
 (4,
  '0.011*"medium" + 0.011*"social" + 0.008*"even" + 0.007*"mean" + 0.007*"computer" + 0.007*"talk" + 0.007*"teacher" + 0.006*"challenge" + 0.006*"hard" + 0.006*"youtube"')]

Prepare data for dynamic topic modelling

  • We didn't have the exact dates for the Word files, so the values in the date_list variables are just for the sake of illustration. If focus group data from other dates become available in the future, this part can be leveraged with the docs from different dates.
In [34]:
Gaming_group_data=[collapse_list_of_strings(text) for gaming_group_text in all_focusgroup_processed_text[7:11]
                                                  for text in gaming_group_text[0]]
number_of_tokens_per_text=[len(gaming_group_text[0]) for gaming_group_text in all_focusgroup_processed_text[7:11]]
date_list=['4/1/2020', '5/1/2020', '11/1/2020', '4/1/2021']
date_list_long=[list(repeat(date_list[i], number_of_tokens_per_text[i])) for i in range(4)]
date_list_long=[date for date_list in date_list_long for date in date_list]
In [35]:
topic_model=BERTopic(language="english", calculate_probabilities=True, verbose=True, n_gram_range=(1,1))
topics, probs = topic_model.fit_transform(Gaming_group_data)
topics_over_time = topic_model.topics_over_time(Gaming_group_data, topics, date_list_long)
2021-12-03 16:40:49,423 - BERTopic - Transformed documents to Embeddings
2021-12-03 16:40:54,439 - BERTopic - Reduced dimensionality with UMAP
2021-12-03 16:40:54,498 - BERTopic - Clustered UMAP embeddings with HDBSCAN
4it [00:01,  2.74it/s]
In [36]:
topic_model.visualize_topics_over_time(topics_over_time)
May 2020Jul 2020Sep 2020Nov 2020Jan 2021Mar 202105101520
Global Topic Representation-1_teacher_zoom_control_class0_watch_social_youtube_video1_technology_computer_impact_write2_son_anxiety_adhd_teacher3_teacher_class_zoom_teachTopics over TimeFrequency
In [37]:
freq = topic_model.get_topic_info(); print(freq.head(10))
topic_model.visualize_barchart(topics=list(range(-1,4)), n_words=10)
   Topic  Count                                Name
0     -1     48       -1_teacher_zoom_control_class
1      0     47        0_watch_social_youtube_video
2      1     38  1_technology_computer_impact_write
3      2     20          2_son_anxiety_adhd_teacher
4      3     14          3_teacher_class_zoom_teach
switch laptop focus remote call room class control zoom teacher instagram talk game snapchat stuff feel video youtube social watch positive grade purpose educational concern beneficial write impact computer technology 00.020.04develop grade distress daughter class child teacher adhd anxiety son 00.020.040.060.08project violin orchestra assignment classroom challenge teach zoom class teacher
Topic Word ScoresTopic -1Topic 0Topic 1Topic 2Topic 3

Run LDA model for Social Group

In [38]:
#Get the Gaming group data from the list of processed word files:
data=[text for gaming_group_text in all_focusgroup_processed_text[11:15] for text in gaming_group_text[0]]
#Prepare Bag of Words representation for LDA:
common_dictionary = Dictionary(data)
common_corpus = [common_dictionary.doc2bow(text) for text in data]
#Print out Words which were included in at least 3 out of the 5 topics:
print(identify_additional_stopwords_with_LDA(data, 5, 3))
#Fit the LDA model:
lda = LdaModel(common_corpus, num_topics=5, id2word=common_dictionary, alpha='auto', eta='auto')
#Check the distribution of topics:
topic_distribution=[lda.get_document_topics(bow=bow_text)[0][0] for bow_text in common_corpus]
print(Counter(topic_distribution))
#Show the most frequent words with their respective probability:
lda.show_topics()
['back', 'phone', 'pandemic', 'use', '1', '00']
Counter({1: 76, 0: 49, 4: 40, 2: 39, 3: 24})
Out[38]:
[(0,
  '0.013*"00" + 0.011*"home" + 0.010*"work" + 0.010*"year" + 0.009*"phone" + 0.007*"need" + 0.007*"room" + 0.007*"sit" + 0.007*"daughter" + 0.006*"back"'),
 (1,
  '0.013*"technology" + 0.010*"something" + 0.010*"work" + 0.009*"00" + 0.009*"need" + 0.008*"pandemic" + 0.007*"negative" + 0.007*"positive" + 0.007*"impact" + 0.006*"game"'),
 (2,
  '0.011*"phone" + 0.009*"use" + 0.008*"social" + 0.008*"way" + 0.008*"anxiety" + 0.008*"people" + 0.008*"pandemic" + 0.007*"back" + 0.006*"guy" + 0.006*"find"'),
 (3,
  '0.019*"use" + 0.011*"phone" + 0.011*"back" + 0.009*"social" + 0.009*"pandemic" + 0.009*"medium" + 0.007*"could" + 0.006*"game" + 0.006*"something" + 0.006*"find"'),
 (4,
  '0.015*"1" + 0.013*"friend" + 0.012*"much" + 0.011*"phone" + 0.010*"take" + 0.009*"pandemic" + 0.008*"back" + 0.008*"make" + 0.008*"social" + 0.008*"day"')]

Prepare data for dynamic topic modelling

  • We didn't have the exact dates for the Word files, so the values in the date_list variables are just for the sake of illustration. If focus group data from other dates become available in the future, this part can be leveraged with the docs from different dates.
In [39]:
Gaming_group_data=[collapse_list_of_strings(text) for gaming_group_text in all_focusgroup_processed_text[11:15]
                                                  for text in gaming_group_text[0]]
number_of_tokens_per_text=[len(gaming_group_text[0]) for gaming_group_text in all_focusgroup_processed_text[11:15]]
date_list=['4/1/2020', '5/1/2020', '11/1/2020', '4/1/2021']
date_list_long=[list(repeat(date_list[i], number_of_tokens_per_text[i])) for i in range(4)]
date_list_long=[date for date_list in date_list_long for date in date_list]
In [40]:
topic_model=BERTopic(language="english", calculate_probabilities=True, verbose=True, n_gram_range=(1,1))
topics, probs = topic_model.fit_transform(Gaming_group_data)
topics_over_time = topic_model.topics_over_time(Gaming_group_data, topics, date_list_long)
2021-12-03 16:43:17,872 - BERTopic - Transformed documents to Embeddings
2021-12-03 16:43:22,612 - BERTopic - Reduced dimensionality with UMAP
2021-12-03 16:43:22,678 - BERTopic - Clustered UMAP embeddings with HDBSCAN
4it [00:02,  1.83it/s]
In [41]:
topic_model.visualize_topics_over_time(topics_over_time)
May 2020Jul 2020Sep 2020Nov 2020Jan 2021Mar 202105101520253035
Global Topic Representation-1_need_use_zoom_work0_work_sleep_day_home1_pandemic_technology_information_social2_phone_daughter_roblox_text3_phone_pandemic_covid_afraidTopics over TimeFrequency
In [42]:
freq = topic_model.get_topic_info(); print(freq.head(10))
topic_model.visualize_barchart(topics=list(range(-1,4)), n_words=10)
   Topic  Count                                      Name
0     -1     95                     -1_need_use_zoom_work
1      0     51                     0_work_sleep_day_home
2      1     44  1_pandemic_technology_information_social
3      2     27              2_phone_daughter_roblox_text
4      3     11             3_phone_pandemic_covid_afraid
room remote home social screen phone work zoom use need bed wake pandemic anxiety teacher stay home day sleep work people education concern talk impact group social information technology pandemic 00.050.113 16 14 mom computer tiktok text roblox daughter phone 00.020.040.060.08laptop panicky shelter technology control access afraid covid pandemic phone
Topic Word ScoresTopic -1Topic 0Topic 1Topic 2Topic 3

Module 3: Analysis of Crisis logger data with Word Cloud

Crisis logger is text data from online surveys. Each row has a upload id related to the session created at the time of logging their response to covid question. The KeyBert model is applied on it to extract the key phrases from the 132 participant sessions.

In [43]:
import pandas as pd

root = "Data/"
df = pd.read_csv(root + 'CrisisLogger/crisislogger.csv')

#duplicated ids are joined in one transcription e.g. 436 and 441 are duplicate ids
df = df.groupby(['upload_id'])['transcriptions'].apply(' '.join).reset_index()
df.head()
Out[43]:
upload_id transcriptions
0 10 so high our experience so far has been a littl...
1 209 I'm not going to stay in my name for the anony...
2 216 so far I have been florentines now for about a...
3 222 it has actually been a very difficult. Trying ...
4 228 so this whole situation has been strange for u...
In [46]:
from keybert import KeyBERT

mydoc = '. '.join([elem for elem in df.transcriptions])

kw_model = KeyBERT()
keywords = kw_model.extract_keywords(mydoc,
                                     keyphrase_ngram_range=(3,3), 
                                     use_mmr=True,
                                     diversity=0.2                                    
                                    )
keywords
Out[46]:
[('parent working home', 0.5365),
 ('daughter taking isolation', 0.529),
 ('difficult home teenage', 0.4776),
 ('working home difficult', 0.4858),
 ('able homeschool daughter', 0.45)]

The main themes gathered are parents working from home, teenagers have been challenging during this time along with challenging work environments.

Word Cloud

In [47]:
from wordcloud import WordCloud

corpus=''
for i in range(len(df)):
    corpus += "".join(df.iloc[i]['transcriptions'])

from collections import Counter
swords = ['know', 'like', 'get', 'one', 'much', 'also', 'even', 'u', 'lot', 'even', 'one', 'go', 'way', 'day', 'see', 'really']

word_list = corpus.split(' ')
resultwords = [word for word in word_list if word not in swords]
result = ' '.join(resultwords)

new_list = result.split(' ')
Counter = Counter(new_list)
most_occur = Counter.most_common(30)
most_occur

wordcloud = WordCloud(width=400, height=400, background_color='white', min_font_size=6).generate(result)

import matplotlib.pyplot as plt
plt.figure(figsize = (8,8), facecolor=None)
plt.imshow(wordcloud)
plt.axis("off")
plt.tight_layout(pad=0)

plt.show()

Module 4: Analysis of Prolific Academic data with Dynamic topic modelling from Bertopic

In [52]:
from sklearn.feature_extraction.text import CountVectorizer
In [48]:
df_p1 = pd.read_csv(root + 'ProlificAcademic\\April 2020\\Data\\CRISIS_Parent_April_2020.csv')
df_p2 = pd.read_csv(root + 'ProlificAcademic\\May 2020\\Data\\CRISIS_Parent_May_2020.csv')
df_p3 = pd.read_csv(root + 'ProlificAcademic\\November 2020\\Data\\CRISIS_Parent_November_2020.csv')
df_p4 = pd.read_csv(root + 'ProlificAcademic\\April 2021\\Data\\CRISIS_Parent_April_2021.csv')
In [50]:
df_pos_1 = df_p1[['timestamp1', 'specifypositive']].dropna()
df_pos_2 = df_p2[['timestamp1', 'specifypositive']].dropna()
df_pos_3 = df_p3[['timestamp1', 'specifypositive']].dropna()
df_pos_4 = df_p4[['timestamp1', 'specifypositive']].dropna()
df_combined = pd.concat([df_pos_1, df_pos_2, df_pos_3, df_pos_4])
In [53]:
positive_things = df_combined.specifypositive.to_list()
dates = df_combined['timestamp1'].apply(lambda x: pd.Timestamp(x)).to_list()
vectorizer_model = CountVectorizer(ngram_range=(1,2), stop_words="english")
topic_model = BERTopic(vectorizer_model=vectorizer_model,
                       #min_topic_size=70,
                       n_gram_range=(1,2),
                       verbose=True)
topics, _ = topic_model.fit_transform(positive_things)
2021-12-03 17:01:11,550 - BERTopic - Transformed documents to Embeddings
2021-12-03 17:01:34,589 - BERTopic - Reduced dimensionality with UMAP
2021-12-03 17:01:34,851 - BERTopic - Clustered UMAP embeddings with HDBSCAN
In [54]:
topic_model.get_topic_info()
Out[54]:
Topic Count Name
0 -1 497 -1_quality time_quality_child_social
1 0 259 0_enjoys_enjoying_things enjoys_daughter
2 1 190 1_activities family_game nights_nights family_...
3 2 130 2_online_school closed_school work_private
4 3 118 3_family times_famiky time_time famiky_time fa...
... ... ... ...
75 74 12 74_playing outside_garden playing_outdoors gar...
76 75 11 75_time families_lot family_families spending_...
77 76 11 76_home schooling_schooling home_schooling 11_...
78 77 11 77_furloughed_furlough_furloughed home_pandemic
79 78 10 78_likes spend_time likes_likes_time talk

80 rows × 3 columns

In [55]:
topic_model.visualize_topics()
Topic 0Topic 0Topic 10Topic 20Topic 30Topic 40Topic 50Topic 60Topic 70Intertopic Distance MapD1D2
In [56]:
topics_over_time = topic_model.topics_over_time(positive_things, topics, dates,  nr_bins=40)
topic_model.visualize_topics_over_time(topics_over_time)
9it [00:54,  6.08s/it]
May 2020Jul 2020Sep 2020Nov 2020Jan 2021Mar 2021May 2021050100
Global Topic Representation-1_quality time_quality_child_social0_enjoys_enjoying_things enjoys_daughter1_activities family_game nights_nights f...2_online_school closed_school work_priva...3_family times_famiky time_time famiky_t...4_family spending_family spend_spending ...5_closer family_family closer_relationsh...6_exercise_exercise exercise_daily exerc...7_school work_school stress_extracurricu...8_responsible_empathy_essential worker_c...9_siblings time_time siblings_sibling ti...10_quality_quality time_family quality_t...11_time spend_spend time_spent time_qual...12_play time_time play_play_games time13_spent family_time spent_spend family_...14_parents time_time parents_parent time...15_washing hands_hand washing_hands wash...16_time mum_time mom_mum dad_mum time17_hygiene_personal hygiene_cleaning_cle...18_parents home_home parents_parents wor...19_home family_family home_family spendi...20_exercise_exercise family_exercising_e...21_time holidays_holidays family_dedicat...22_father working_dad home_dad working_h...23_home time_time home_play home_door ti...24_time time_spent time_spent working_lo...25_walking_walk_walks family_family walk...26_kids_time child_work home_time kids27_home family_time home_family home_hom...28_school closure_school closed_isolatio...29_time spending_time spend_spendng time...30_bonding_family bonding_bonds family_b...31_rides spend_family got_family enjoyed...32_time outside_backyard_backyard time_t...33_house_ranch_hes reading_helping house34_outdoors_outdoor_outdoor activity_out...35_siblings_time siblings_baby brother_l...36_rest sleep_sleep sleep_sleep relaxed_...37_increased family_money_saved money_sc...38_time dad_dad time_father time_time fa...39_stressed_pressure relaxed_stress_pres...40_family spending_spending family_busy ...41_grandmother_time doing_spends time_gr...42_family adventures_enjoying daily_walk...43_parents time_time parents_parents_44_homeschooling_homeschooling better_ho...45_eating_healthier eating_healthier_eat...46_immediate family_father disabled_care...47_bonding parents_bond parents_parents ...48_prefers_happier doesnt_loves home_pre...49_anxiety_school anxiety_anxiety school...50_family extra_extra family_chance reco...51_reading_reading time_reading reading_...52_sibling relationship_relationships si...53_potty training_household time_work po...54_spent family_stronger family_family f...55_things_appreciation_things granted_ap...56_new skills_skills_learning new_skills...57_home staying_staying home_home happy_...58_able spend_spend time_time able_helpi...59_school school_exams school_exams_scho...60_spent parents_parents time_spend pare...61_skills_family learning_equipment_skil...62_lockdown_lockdown quality_parents loc...63_parents spending_time parents_parents...64_family spending_activities usually_at...65_home working_working home_work home_h...66_learning home_home learning_home redu...67_learning home_home learning_schooling...68_cooking_cooking home_cook_time cookin...69_family quolity_family intentional_val...70_community_nhs_sense community_communi...71_home time_time home_love home_home lo...72_friends_friends video_friends online_...73_education_sister education_attention ...74_playing outside_garden playing_outdoo...75_time families_lot family_families spe...76_home schooling_schooling home_schooli...77_furloughed_furlough_furloughed home_p...78_likes spend_time likes_likes_time tal...Topics over TimeFrequency
In [ ]:
topic_model.get_topic(0)
In [ ]:
topic_model.get_topic(3)

The blue line Topic 0 indicates the topic where most of the words relate to the enjoyment of the lock down. From the plot it is seen that in the beginning of the lockdown April 2020, this factor peaked off whereas it gradually decreased as the families became used to of the new routine being at home all the time. Same goes for the topic 3 since excessive family time has been a welcome change in the initial days but till November 2020 this this element is not that frequent. i.e. its importance dropped with time.

In [ ]: